I was a business analyst in the bank and I have been thinking about how we can make a better business decision based on the data. In 2019, I had an opportunity to continue my masters in data and computational science. Then in September, my job-seeking journey started. In reviewing various job descriptions of a data analyst from LinkedIn, Indeed and Glassdoor; I found myself questioning about:
“what is a data analyst doing?”
"what are the KEYWORDS of a data analyst job summaries?
In order to answer these questions, I used natural language processing (NLP) techniques to analyze the keywords in job summary for a data analyst.
How we Get the Data Set from Indeed?
We will do a simple scrape with rvest.
# Let's load the packages
library(tidyverse) # clean and tidy the data
library(rvest) # web scraping
library(xml2) # read the html page
# 
Okay, here is the section where we should tailoring the url. Since, I am looking for the data analyst position in Ireland, I used ie.indeed and “Data analyst” keyword which gave me this link:https://ie.indeed.com/jobs?q=Data+Analyst as my first page result.
Step 1. My first page link
Step 2. After installing selector gadget, it will show up on your chrome addins
Step 3. Selecting the area that you want
Step 4. Copy the “.summary” for your html_nodes
So, I detected the pattern on indeed’s link page which is not ended by 1,2,3, … It is the multiplication of 10. Here is the loop code for that condition:
# Specifying the url
start <- 10 # where the page starts
end <- 1000 # last page, depends on how many data that you want
links <- seq(start, end, by = 10) # it will return 10, 20, ... , 500
Alright, we loop the links and now we need to store the result into a dataframe.
# Make an empty dataframe to store the data
data <- data.frame()
# Let's loop!
# we will process the links, one by one, that's why I used seq_along function
for(i in seq_along(links)) {
Initial_page <- "https://ie.indeed.com/jobs?q=analyst&l=Ireland" # the very first page
url <- paste0(Initial_page, "&start=", links[i]) # construct the url by pasting
page <- xml2::read_html(url) # read the html
# Sys.sleep pauses R for two seconds to avoid the error message
Sys.sleep(2)
# right-click on page - inspect and you can use CSS Selector addins on Chrome
# get the job title
job_title <- page %>%
rvest::html_nodes("div") %>%
rvest::html_nodes(xpath = '//a[@data-tn-element = "jobTitle"]') %>%
rvest::html_attr("title")
# get job location CSS selector
job_location <- page %>%
rvest::html_nodes('.accessible-contrast-color-location') %>%
rvest::html_text() %>%
stringi::stri_trim_both()
# get the company name
company_name <- page %>%
rvest::html_nodes("span") %>%
rvest::html_nodes(xpath = '//*[@class="company"]') %>%
rvest::html_text() %>%
stringi::stri_trim_both() -> company.name
# get job description CSS selector
job_description <- page %>%
rvest::html_nodes('.summary') %>%
rvest::html_text() %>%
stringi::stri_trim_both()
df <- data.frame(job_title, job_location, company_name, job_description)
data <- rbind(data, df)
}
Next, I only have an interest to find a job in Dublin. Then I subset the data with unique location = Dublin.
# New Dublin Data set
df_IE <- data %>%
dplyr::distinct() %>%
dplyr::mutate(city = "Dublin") # add column city = Dublin
# Cleaning
df_IE$job_description <- gsub("[\r\n]", "", df_IE$job_description)
# in case you want to save the dataset into a csv
write.csv(df_IE,"df_IE.csv")
I will use the a basic NLP technique that constructs features based on term frequencies. I made use of these features to train the classifier given a collection of texts, known as a corpus.
The summary of a job description will list the number of years of experience desired. We will see how the years of experience were distributed in Dublin data set.
library(extrafont)
df <- df_IE
library(tm)
df$job_description = tolower(df$job_description)
# remove stop words
df$job_description = removeWords(df$job_description, stopwords("english"))
df$job_description = removePunctuation(df$job_description)
df$job_description = removeNumbers(df$job_description)
Colors <- function(n) {
hues = seq(390, 850, length = n + 1)
hcl(h = hues, l = 85, c = 70)[1:n]}
plotYearsExp <- function(text){
library(stringr)
# Function to extract numbers from text
numextract <- function(string){
str_extract(string, "\\-*\\d+\\.*\\d*")}
# Extract numbers
years = numextract(text)
# Remove NAs
years <- as.numeric(years[!is.na(years)])
# Relevant limit of years
years <- years[(years < 10) & (years %% 1 == 0) & (years >= 0)]
# Colors selection
colors = Colors(1)
library(ggpubr)
p <- gghistogram(data.frame(years),
x = "years",
fill = colors[1],
color = colors[1],
alpha=0.75,
binwidth = 1,
size=1.5, ylim=c(0,20))
ggpar(p, xlab = "Years", ylab = "Frequency",
title = "Work Experience Suggested",
font.x = c(13,"bold"),
font.y = c(13,"bold"),
font.title = c(18,"bold"),
font.subtitle = 16,
font.family = "Arial",
font.tickslab = 12, xticks.by = 1,
orientation = c("vertical", "horizontal", "reverse"),
ggtheme = theme_pubr())+ font("title", hjust=0.5)
}
plotYearsExp(unlist(df))
library(dplyr)
library(ggplot2)
library(tidytext)
## Warning: package 'tidytext' was built under R version 4.0.3
data <- df
Here is the count of top 10 most frequent words:
# Counting tokens
count = data %>%
unnest_tokens(output = "word",
token = "words",
input = job_description) %>%
count(word, sort = TRUE)
head(count,10)
## word n
## 1 analyst 393
## 2 team 204
## 3 business 180
## 4 analysts 160
## 5 will 150
## 6 client 134
## 7 dublin 122
## 8 join 112
## 9 data 110
## 10 support 101
#tidytext representation
company_words <- data %>%
unnest_tokens(output = "word", token = "words",
input = job_description) %>%
anti_join(stop_words) %>%
count(company_name, word, sort = TRUE)
head(company_words,10)
## company_name word n
## 1 Accenture team 19
## 2 Accenture review 18
## 3 Eolas Recruitment analyst 18
## 4 Google advisers 16
## 5 Google analysts 16
## 6 Google analyze 16
## 7 Google closely 16
## 8 Google collaborate 16
## 9 Google customer 16
## 10 Google specialists 16
Before proceeding to classification, I visualized term frequencies and associations. First, I produced word clouds that depicted the 200 most frequent terms weighted by their frequency:
preprocessing <- function(text){
library(tm)
# Create the toSpace content transformer
toSpace <- content_transformer(function(x, pattern) {return (gsub(pattern," ", x))})
# Perform necessary operations to corpus
corpus <- Corpus(VectorSource(text))
corpus <- tm_map(corpus, toSpace, "/")
corpus <- tm_map(corpus, toSpace, "-")
# Convert to lowercase
corpus <- tm_map(corpus, content_transformer(tolower))
# Remove punctuation
corpus <- tm_map(corpus, removePunctuation)
# Strip whitespace
corpus <- tm_map(corpus, stripWhitespace)
# Remove stopwords
corpus <- tm_map(corpus, removeWords, stopwords("english"))
# Remove numbers
corpus <- tm_map(corpus, removeNumbers)
# Stem the corpus
corpus <- tm_map(corpus, stemDocument, "english")
# Create document term matrix (dtm)
dtm <- DocumentTermMatrix(corpus)
# Removes terms with sparsity >= 99%
dtm <- removeSparseTerms(dtm, 0.99)
}
# Load the job descriptions
text.summary <- df_IE$job_description
# Create document-term matrices for the summaries
dtm.summary <- preprocessing(text.summary)
# Combine respective DTMs into collective matrix
dtm <- c(dtm.summary)
set.seed(1111)
plotWordcloud <- function(dtm){
m <- as.matrix(t(dtm))
v <- sort(rowSums(m),decreasing = TRUE)
d <- data.frame(word = names(v),freq=v)
# Wordcloud of 200 most frequently used words
library(wordcloud2)
set.seed(123)
wordcloud2(d)}
plotWordcloud(dtm.summary)
Some of the terms represent stemmed versions of proper English words (i.e. experi instead of experience). Data and analyst were the most frequent terms in the overall corpus.
I stored the information scraped from Indeed onto a data frame. I also plotted the years of experience for the numeric values in the scraped job summaries to compare against the presampled qualifications corpus from Part 1. Finally, I transformed the corpus by removing stopwords, punctuation, and numbers, and converted it to lowercase.
The bag-of-words approach has a pitfall, it is a quick but dirty scheme to capture the keywords available. It does not always capture the meaning in the appropriate context.
I applied the GloVe algorithm, into the job summary corpus, examining both unigrams (single terms) and bigrams (pair of consecutive terms).
library(text2vec)
# Use itoken to form vocabulary
iteration = itoken(df$job_description,
preprocessor = tolower,
tokenizer = word_tokenizer)
# We need uni and bi-grams
vocab <- create_vocabulary(iteration, ngram = c(1L, 2L))
vocab <- prune_vocabulary(vocab,
# retain words whose frequencies > 15
term_count_min = 15,
doc_proportion_min = 0.001)
# Vectorize the vocab
vectorizer <- vocab_vectorizer(vocab)
# Use window size of top for context words (captures most document lengths)
tcm <- create_tcm(iteration, vectorizer, skip_grams_window = 10L)
set.seed(123)
# GloVe model
glove <- GloVe$new(rank = 50, x_max = 10)
main <- glove$fit_transform(tcm, n_iter = 50, verbose = FALSE)
## INFO [12:17:20.517] epoch 1, loss 0.3102
## INFO [12:17:20.567] epoch 2, loss 0.1174
## INFO [12:17:20.584] epoch 3, loss 0.0801
## INFO [12:17:20.593] epoch 4, loss 0.0628
## INFO [12:17:20.603] epoch 5, loss 0.0516
## INFO [12:17:20.609] epoch 6, loss 0.0436
## INFO [12:17:20.615] epoch 7, loss 0.0374
## INFO [12:17:20.620] epoch 8, loss 0.0327
## INFO [12:17:20.626] epoch 9, loss 0.0288
## INFO [12:17:20.632] epoch 10, loss 0.0257
## INFO [12:17:20.638] epoch 11, loss 0.0231
## INFO [12:17:20.644] epoch 12, loss 0.0209
## INFO [12:17:20.649] epoch 13, loss 0.0190
## INFO [12:17:20.655] epoch 14, loss 0.0174
## INFO [12:17:20.661] epoch 15, loss 0.0159
## INFO [12:17:20.667] epoch 16, loss 0.0147
## INFO [12:17:20.673] epoch 17, loss 0.0136
## INFO [12:17:20.678] epoch 18, loss 0.0126
## INFO [12:17:20.684] epoch 19, loss 0.0117
## INFO [12:17:20.689] epoch 20, loss 0.0109
## INFO [12:17:20.695] epoch 21, loss 0.0102
## INFO [12:17:20.700] epoch 22, loss 0.0096
## INFO [12:17:20.706] epoch 23, loss 0.0090
## INFO [12:17:20.711] epoch 24, loss 0.0084
## INFO [12:17:20.717] epoch 25, loss 0.0080
## INFO [12:17:20.723] epoch 26, loss 0.0075
## INFO [12:17:20.728] epoch 27, loss 0.0071
## INFO [12:17:20.734] epoch 28, loss 0.0067
## INFO [12:17:20.739] epoch 29, loss 0.0064
## INFO [12:17:20.744] epoch 30, loss 0.0060
## INFO [12:17:20.749] epoch 31, loss 0.0057
## INFO [12:17:20.755] epoch 32, loss 0.0055
## INFO [12:17:20.760] epoch 33, loss 0.0052
## INFO [12:17:20.765] epoch 34, loss 0.0049
## INFO [12:17:20.771] epoch 35, loss 0.0047
## INFO [12:17:20.776] epoch 36, loss 0.0045
## INFO [12:17:20.781] epoch 37, loss 0.0043
## INFO [12:17:20.787] epoch 38, loss 0.0041
## INFO [12:17:20.792] epoch 39, loss 0.0039
## INFO [12:17:20.798] epoch 40, loss 0.0038
## INFO [12:17:20.804] epoch 41, loss 0.0036
## INFO [12:17:20.810] epoch 42, loss 0.0035
## INFO [12:17:20.815] epoch 43, loss 0.0033
## INFO [12:17:20.820] epoch 44, loss 0.0032
## INFO [12:17:20.826] epoch 45, loss 0.0031
## INFO [12:17:20.831] epoch 46, loss 0.0030
## INFO [12:17:20.836] epoch 47, loss 0.0029
## INFO [12:17:20.842] epoch 48, loss 0.0027
## INFO [12:17:20.848] epoch 49, loss 0.0026
## INFO [12:17:20.853] epoch 50, loss 0.0025
context <- glove$components
# final combination
complete <- t(main) + context
colnames(complete)
## [1] "accounts" "analytics"
## [3] "behalf" "cork"
## [5] "deliver" "developers"
## [7] "european" "every"
## [9] "expanding" "experiences"
## [11] "managers" "together"
## [13] "advisers" "advisers_support"
## [15] "analysts_advisers" "analyze_customer"
## [17] "basis" "closely_spot"
## [19] "collaborate_closely" "customer_needs"
## [21] "insurance" "investment"
## [23] "lab" "minimum"
## [25] "monitoring" "multinational"
## [27] "needs_trends" "project"
## [29] "range" "specialists_collaborate"
## [31] "spot" "spot_analyze"
## [33] "strategists" "strategists_analysts"
## [35] "strong" "support_specialists"
## [37] "team_lead" "teams_strategists"
## [39] "ability" "build"
## [41] "ensure" "excellent"
## [43] "job_title" "key"
## [45] "life" "location"
## [47] "markets" "products"
## [49] "projects" "reports"
## [51] "specialists" "users"
## [53] "will_work" "client_leading"
## [55] "exciting" "industry"
## [57] "title" "career"
## [59] "compliance" "currently_recruiting"
## [61] "laboratory" "leader"
## [63] "years_experience" "city_centre"
## [65] "level" "manage"
## [67] "months" "performance"
## [69] "providing" "support_analyst"
## [71] "us" "dublin_city"
## [73] "one" "provides"
## [75] "recruit" "software"
## [77] "will_responsible" "analytical"
## [79] "analyze" "collaborate"
## [81] "environment" "firm"
## [83] "growing" "innovative"
## [85] "month" "process"
## [87] "sales" "skills"
## [89] "candidate" "city"
## [91] "successful" "training"
## [93] "trends" "business_analysts"
## [95] "digital" "market"
## [97] "office" "required"
## [99] "testing" "content"
## [101] "development" "information"
## [103] "permanent" "manager"
## [105] "part" "technical"
## [107] "service" "centre"
## [109] "financial_services" "help"
## [111] "international" "ireland"
## [113] "recruiting" "across"
## [115] "analysis" "finance"
## [117] "needs" "position"
## [119] "risk" "systems"
## [121] "teams" "based_dublin"
## [123] "clients" "join_team"
## [125] "people" "review"
## [127] "data_analyst" "product"
## [129] "group" "closely"
## [131] "lead" "operations"
## [133] "within" "reporting"
## [135] "job" "management"
## [137] "analyst_will" "qc"
## [139] "customer" "experienced"
## [141] "quality" "years"
## [143] "new" "provide"
## [145] "solutions" "technology"
## [147] "seeking" "working"
## [149] "opportunity" "contract"
## [151] "currently" "responsible"
## [153] "senior" "analyst_join"
## [155] "business_analyst" "leading"
## [157] "based" "financial"
## [159] "company" "experience"
## [161] "services" "work"
## [163] "looking" "global"
## [165] "role" "support"
## [167] "data" "join"
## [169] "dublin" "client"
## [171] "will" "analysts"
## [173] "business" "team"
## [175] "analyst"
plotSimilarWords <- function(word1, word2, complete){
colors = Colors(2)
# Finds most similar terms given a target vector among all word vectors
findSimilarWords <- function(word,word_vectors){
library(text2vec)
target = word_vectors[,word,drop=FALSE]
cos_sim = sim2(t(word_vectors),t(target),method='cosine',norm='l2')
similar = head(sort(cos_sim[,1], decreasing = TRUE), 11)
similar = similar[-1]
}
library(dplyr)
library(tibble)
query1 = data.frame(Similarity = findSimilarWords(word1,complete))
query2 = data.frame(Similarity = findSimilarWords(word2,complete))
query <- query1 %>%
rownames_to_column("Term") %>%
mutate(Selected = word1) %>%
bind_rows(query2 %>% rownames_to_column("Term") %>%
mutate(Selected = word2)) %>%
group_by(Selected) %>%
arrange(Selected,desc(Similarity)) %>%
ungroup()
head(query,10)
plot1 <- ggdotchart(query[1:10,], x = "Term", y = "Similarity",
shape = 18, dot.size = 8, color = colors[1],
add = "segments", add.params = list(size=2),
rotate = TRUE, sorting = "descending")
parplot1 <- ggpar(plot1, title = paste("\"",word1,"\"",sep = ""),
legend = "none", font.x = c(14,"bold"), font.y = c(14,"bold"),
font.title = 16, font.family = "Arial",
font.xtickslab = c(12), font.ytickslab = c(14),
ggtheme = theme_bw())+ font("title",hjust=0.5) + rremove("xylab")
plot2 <- ggdotchart(query[11:20,], x = "Term", y = "Similarity",
shape = 18, dot.size = 8, color = colors[2],
add = "segments", add.params = list(size=2),
rotate = TRUE, sorting = "descending")
parplot2 <- ggpar(plot2, title = paste("\"",word2,"\"",sep = ""),
legend = "none", font.x = c(14,"bold"), font.y = c(14,"bold"),
font.title = 16, font.family = "Arial",
font.xtickslab = c(12), font.ytickslab = c(14),
ggtheme = theme_bw())+ font("title",hjust=0.5) + rremove("xylab")
figure <- ggarrange(parplot1, parplot2, ncol = 2, labels = c("(A)","(B)"), label.x = 0.1)
annotate_figure(figure,
top = text_grob(paste("Which word vectors are most similar to \"",
word1, "\" or \"", word2,"\"?", sep = ""),
hjust = 0.5, family = "Arial",
face = "bold", size = 18))
}
plotSimilarWords("business_analyst","data_analyst",complete)
MDS seeks to preserve the distance between vectors. Since vector distances within GloVe encode some semantic meaning, it would be ideal to preserve the relative term topology. We applied MDS with Euclidean distances between these word vectors.
# MDS with Euclidean distance
set.seed(123)
vectordata = dist(scale(t(complete)))
mds.euc <- data.frame(cmdscale(vectordata))
# Label each point with corresponding word and color according to frequency
words <- colnames(complete)
Freq_logtrans <- log10(vocab$term_count)
plotGlove <- function(mdsout,words,Freq_logtrans,metric){
colors = Colors(2)
library(ggplot2)
library(plotly)
p<-ggplot(mdsout, aes(x = X1, y = X2,size=14)) +
geom_label(aes(label = words, fill = Freq_logtrans, fontface = "bold"))+
theme_bw()+
theme(axis.line = element_line(size=1, colour = "black"),
panel.grid.major = element_line(colour = "#d3d3d3"),
panel.grid.minor = element_blank(),
panel.border = element_blank(), panel.background = element_blank(),
plot.title = element_text(family="Arial", size = 18, face = "bold", hjust=0.5),
plot.caption = element_text(family="Arial", size = 12, face ="bold", hjust=0.5),
axis.text.x=element_text(colour="black", size = 12),
axis.text.y=element_text(colour="black", size = 12),
text=element_text(family="Arial", size = 14))+
guides(size=FALSE)+
scale_fill_gradient(low=colors[2],high=colors[1])+
labs(title=paste("MDS of Word Vectors (", metric, ")",sep=""),
caption="*Based on the Indeed job summary corpus", fill="Frequency")
ggplotly(p)
}
plotGlove(mds.euc,words,Freq_logtrans,"Euclidean Distance")
# MDS with Cosine Distance
set.seed(123)
cosdata = 1-sim2(t(complete),t(complete),method="cosine",norm = "l2")
mds.cos <- data.frame(cmdscale(cosdata))
plotGlove(mds.cos,words,Freq_logtrans,"Cosine Distance")
## Warning in geom2trace.default(dots[[1L]][[1L]], dots[[2L]][[1L]], dots[[3L]][[1L]]): geom_GeomLabel() has yet to be implemented in plotly.
## If you'd like to see this geom implemented,
## Please open an issue with your example code at
## https://github.com/ropensci/plotly/issues